Chronological Sampling for Email Filtering
نویسندگان
چکیده
User models for email filtering should be developed from appropriate training and test sets. A k-fold cross-validation is commonly presented in the literature as a method of mixing old and new messages to produce these data sets. We show that this results in overly optimistic estimates of the email filter’s accuracy in classifying future messages because the test set has a higher probability of containing messages that are similar to those in the training set. We propose the k-fold chronological cross-validation method that preserves the chronology of the email messages in the test set.
منابع مشابه
Spam Corpus Creation for TREC
TREC’s Spam Filtering Track (Cormack & Lynam, 2005) introduces a standard testing framework that is designed to model a spam filter’s usage as closely as possible, to measure quantities that reflect the filter’s effectiveness for its intended purpose, and to yield repeatable (i.e. controlled and statistically valid) results. The TREC Spam Filter Evaluation Toolkit is free software that, given a...
متن کاملVisualizing Rhythms of Intimacy in Email Communication
Visualizing Rhythms of Intimacy in Email Communication. (December 2004) Mirko Mandic, B.S., Charleston Southern University Chair of Advisory Committee: Dr. Andruid Kerne Email has developed into one of the most extensively used computer applications. Email interfaces, on the other hand, have gone through very few transformations since their inception. As the growing volumes of email data accumu...
متن کاملUtilizing Multi-Field Text Features for Efficient Email Spam Filtering
Large-scale spam emails cause a serious waste of time and resources. This paper investigates the text features of email documents and the feature noises among multi-field texts, resulting in an observation of a power law distribution of feature strings within each text field. According to the observation, we propose an efficient filtering approach including a compound weight method and a lightw...
متن کاملActive Sampling for Rank Learning via Optimizing the Area under the ROC Curve
Learning ranking functions is crucial for solving many problems, ranging from document retrieval to building recommendation systems based on an individual user’s preferences or on collaborative filtering. Learning-to-rank is particularly necessary for adaptive or personalizable tasks, including email prioritization, individualized recommendation systems, personalized news clipping services and ...
متن کاملArtificial Immune System for Collaborative Spam Filtering
Artificial immune systems (AIS) use the concepts and algorithms inspired by the theory of how the human immune system works. This document presents the design and initial evaluation of a new artificial immune system for collaborative spam filtering. Collaborative spam filtering allows for the detection of not-previously-seen spam content, by exploiting its bulkiness. Our system uses two novel a...
متن کامل